This is an interactive notebook. You can run it locally or use the links below:

Using HuggingFace Datasets in evaluations with `preprocess_model_input`

Note: This is a temporary workaround

This guide demonstrates a workaround for using HuggingFace Datasets with Weave evaluations.

We are actively working on developing more seamless integrations that will simplify this process.
While this approach works, expect improvements and updates in the near future that will make working with external datasets more straightforward.

Setup and imports

First, we initialize Weave and connect to Weights & Biases for tracking experiments.

!pip install datasets wandb weave
python
# Initialize variables
HUGGINGFACE_DATASET = "wandb/ragbench-test-sample"
WANDB_KEY = ""
WEAVE_TEAM = ""
WEAVE_PROJECT = ""

# Init weave and required libraries
import asyncio

import nest_asyncio
import wandb
from datasets import load_dataset

import weave
from weave import Evaluation

# Login to wandb and initialize weave
wandb.login(key=WANDB_KEY)
client = weave.init(f"{WEAVE_TEAM}/{WEAVE_PROJECT}")

# Apply nest_asyncio to allow nested event loops (needed for some notebook environments)
nest_asyncio.apply()

Load and prepare HuggingFace dataset

We load a HuggingFace dataset.
Create an index mapping to reference the dataset rows.
This index approach allows us to maintain references to the original dataset.

Note:
In the index, we encode the hf_hub_name along with the hf_id to ensure each row has a unique identifier.
This unique digest value is used for tracking and referencing specific dataset entries during evaluations.

# Load the HuggingFace dataset
ds = load_dataset(HUGGINGFACE_DATASET)
row_count = ds["train"].num_rows

# Create an index mapping for the dataset
# This creates a list of dictionaries with HF dataset indices
# Example: [{"hf_id": 0}, {"hf_id": 1}, {"hf_id": 2}, ...]
hf_index = [{"hf_id": i, "hf_hub_name": HUGGINGFACE_DATASET} for i in range(row_count)]

Define processing and evaluation functions

Processing pipeline

preprocess_example: Transforms the index reference into the actual data needed for evaluation
hf_eval: Defines how to score the model outputs
function_to_evaluate: The actual function/model being evaluated

@weave.op()
def preprocess_example(example):
    """
    Preprocesses each example before evaluation.
    Args:
        example: Dict containing hf_id
    Returns:
        Dict containing the prompt from the HF dataset
    """
    hf_row = ds["train"][example["hf_id"]]
    return {"prompt": hf_row["question"], "answer": hf_row["response"]}

@weave.op()
def hf_eval(hf_id: int, output: dict) -> dict:
    """
    Scoring function for evaluating model outputs.
    Args:
        hf_id: Index in the HF dataset
        output: The output from the model to evaluate
    Returns:
        Dict containing evaluation scores
    """
    hf_row = ds["train"][hf_id]
    return {"scorer_value": True}

@weave.op()
def function_to_evaluate(prompt: str):
    """
    The function that will be evaluated (e.g., your model or pipeline).
    Args:
        prompt: Input prompt from the dataset
    Returns:
        Dict containing model output
    """
    return {"generated_text": "testing "}

Create and run evaluation

For each index in hf_index:
1. preprocess_example gets the corresponding data from the HF dataset.
2. The preprocessed data is passed to function_to_evaluate.
3. The output is scored using hf_eval.
4. Results are tracked in Weave.

# Create evaluation object
evaluation = Evaluation(
    dataset=hf_index,  # Use our index mapping
    scorers=[hf_eval],  # List of scoring functions
    preprocess_model_input=preprocess_example,  # Function to prepare inputs
)

# Run evaluation asynchronously
async def main():
    await evaluation.evaluate(function_to_evaluate)

asyncio.run(main())

Documentation Index

​Using HuggingFace Datasets in evaluations with preprocess_model_input

​Note: This is a temporary workaround

​Setup and imports

​Load and prepare HuggingFace dataset

​Define processing and evaluation functions

​Processing pipeline

​Create and run evaluation

Using HuggingFace Datasets in evaluations with `preprocess_model_input`

Note: This is a temporary workaround

Setup and imports

Load and prepare HuggingFace dataset

Define processing and evaluation functions

Processing pipeline

Create and run evaluation